85 research outputs found
Pluralistic Image Completion
Most image completion methods produce only one result for each masked input,
although there may be many reasonable possibilities. In this paper, we present
an approach for \textbf{pluralistic image completion} -- the task of generating
multiple and diverse plausible solutions for image completion. A major
challenge faced by learning-based approaches is that usually only one ground
truth training instance per label. As such, sampling from conditional VAEs
still leads to minimal diversity. To overcome this, we propose a novel and
probabilistically principled framework with two parallel paths. One is a
reconstructive path that utilizes the only one given ground truth to get prior
distribution of missing parts and rebuild the original image from this
distribution. The other is a generative path for which the conditional prior is
coupled to the distribution obtained in the reconstructive path. Both are
supported by GANs. We also introduce a new short+long term attention layer that
exploits distant relations among decoder and encoder features, improving
appearance consistency. When tested on datasets with buildings (Paris), faces
(CelebA-HQ), and natural images (ImageNet), our method not only generated
higher-quality completion results, but also with multiple and diverse plausible
outputs.Comment: 21 pages, 16 figure
IPO-LDM: Depth-aided 360-degree Indoor RGB Panorama Outpainting via Latent Diffusion Model
Generating complete 360-degree panoramas from narrow field of view images is
ongoing research as omnidirectional RGB data is not readily available. Existing
GAN-based approaches face some barriers to achieving higher quality output, and
have poor generalization performance over different mask types. In this paper,
we present our 360-degree indoor RGB panorama outpainting model using latent
diffusion models (LDM), called IPO-LDM. We introduce a new bi-modal latent
diffusion structure that utilizes both RGB and depth panoramic data during
training, but works surprisingly well to outpaint normal depth-free RGB images
during inference. We further propose a novel technique of introducing
progressive camera rotations during each diffusion denoising step, which leads
to substantial improvement in achieving panorama wraparound consistency.
Results show that our IPO-LDM not only significantly outperforms
state-of-the-art methods on RGB panorama outpainting, but can also produce
multiple and diverse well-structured results for different types of masks
T2Net: Synthetic-to-Realistic Translation for Solving Single-Image Depth Estimation Tasks
Current methods for single-image depth estimation use training datasets with
real image-depth pairs or stereo pairs, which are not easy to acquire. We
propose a framework, trained on synthetic image-depth pairs and unpaired real
images, that comprises an image translation network for enhancing realism of
input images, followed by a depth prediction network. A key idea is having the
first network act as a wide-spectrum input translator, taking in either
synthetic or real images, and ideally producing minimally modified realistic
images. This is done via a reconstruction loss when the training input is real,
and GAN loss when synthetic, removing the need for heuristic
self-regularization. The second network is trained on a task loss for synthetic
image-depth pairs, with extra GAN loss to unify real and synthetic feature
distributions. Importantly, the framework can be trained end-to-end, leading to
good results, even surpassing early deep-learning methods that use real paired
data.Comment: 15 pages, 8 figure
Visibility Constrained Generative Model for Depth-based 3D Facial Pose Tracking
In this paper, we propose a generative framework that unifies depth-based 3D
facial pose tracking and face model adaptation on-the-fly, in the unconstrained
scenarios with heavy occlusions and arbitrary facial expression variations.
Specifically, we introduce a statistical 3D morphable model that flexibly
describes the distribution of points on the surface of the face model, with an
efficient switchable online adaptation that gradually captures the identity of
the tracked subject and rapidly constructs a suitable face model when the
subject changes. Moreover, unlike prior art that employed ICP-based facial pose
estimation, to improve robustness to occlusions, we propose a ray visibility
constraint that regularizes the pose based on the face model's visibility with
respect to the input point cloud. Ablation studies and experimental results on
Biwi and ICT-3DHP datasets demonstrate that the proposed framework is effective
and outperforms completing state-of-the-art depth-based methods
Conditional Adversarial Synthesis of 3D Facial Action Units
Employing deep learning-based approaches for fine-grained facial expression
analysis, such as those involving the estimation of Action Unit (AU)
intensities, is difficult due to the lack of a large-scale dataset of real
faces with sufficiently diverse AU labels for training. In this paper, we
consider how AU-level facial image synthesis can be used to substantially
augment such a dataset. We propose an AU synthesis framework that combines the
well-known 3D Morphable Model (3DMM), which intrinsically disentangles
expression parameters from other face attributes, with models that
adversarially generate 3DMM expression parameters conditioned on given target
AU labels, in contrast to the more conventional approach of generating facial
images directly. In this way, we are able to synthesize new combinations of
expression parameters and facial images from desired AU labels. Extensive
quantitative and qualitative results on the benchmark DISFA dataset demonstrate
the effectiveness of our method on 3DMM facial expression parameter synthesis
and data augmentation for deep learning-based AU intensity estimation
- …